Project

Overview

This project utilizes the notes on optimal recovery - data assimilation. You will ultimately implement Algorithm 1 from those notes using Pandas and cvxpy, and will use your implementation to estimate both the global mean temperature and the mean temperature at a given location. During this project you will become familiar with manipulating and using Pandas DataFrames, including those with MultiIndex's.

Instructions

  1. Run the code blocks provided to load the temperature data and station inventory.

  2. Complete the function compute_monthly_temperature_means to return the global monthly mean temperature by simple averaging.

  3. Complete the function compute_annual_temperature_means to return the global annual mean temperature by simple averaging.

  4. Complete the functions get_operating_stations_by_year and get_operating_stations_by_month to identify the stations operating in the given date ranges.

  5. Complete the function get_station_locations to return the latitudes and longitudes of the passed set of stations from the inventory.

  6. Complete the function get_valid_station_temps_by_month and get_valid_station_temps_by_year. These functions will give access to the valid monthly and annual temperature data for all stations operating during the specified dates.

  7. Complete the function setup_system_global_mean_pwc to create the system matrix and right hand side for the minimization. The system matrix will have the number of rows equal to the number of basis functions, e.g. the number of cells (boxes) in our grid, and the number of columns equal to the number of measurements. The values in this matrix will be either zero or one. If an entry at (row, col) is one, then the measurement corresponding to col exists within the box corresponding to row. The entries in the right hand side should be equal to the area of the grid cell (box) divided by the total surface area. For the surface area of a grid cell (box) you can use the formula $\frac{\pi}{180} \cdot E^2 \cdot |sin(b_0) - sin(b_1)| \cdot |b_2 - b_3|$, where $E = 6375$km is an approximate radius of the Earth and $[b_0, b_1, b_2, b_3]$ are the north, south latitudes and west, east longitudes respectively. Also, complete the box_contains function as it will be helpful in your implementation of setup_system_global_mean_pwc and setup_system_location_mean_pwc later in the project. The function simply returns True if the passed latitude and longitude are "within" the box and False otherwise. A lat,lon pair is within a box if south < lat <= north and west < lon <= east. Do note that with this test we exclude any station that is located at the south pole, which we will ignore for this project.

  8. Complete the function optimize_weights_pwc using cvxpy to perform the optimization from Algorithm 1 in the notes with piece-wise constant basis.

  9. Complete the function optimize_global_mean_pwc. This function will compute either the global mean temperature for a given month in a given year, or if month is None then it will compute the global annual mean temperature for the given year. This function should leverage the functions you have already implemented and should return the global temperature, $\sum_{i=1}^m a_i \cdot y_i$.

  10. For this function you may copy your implementation from setup_system_global_mean_pwc and modify it slightly to implement setup_system_location_mean_pwc. Instead of the global mean temperature, our quantity of interest will be a location (or grid cell in this instance). The matrix $B$ should be the same, but the entries in the righthand side $c$ should only be set if the corresponding basis function is non-zero at eval_location.

  11. For this function you may copy your implementation from optimize_global_mean_pwc and modify it slightly to implement optimize_location_mean_pwc as it should only differ by the use of setup_system_location_mean_pwc.

  12. Complete the function get_station_data_by_state to identify all stations in a given state and return new DataFrame's containing only those stations and their data. Be aware that the inventory may contain stations that do not appear in the data, and the returned DataFrames should not contain station data for stations that do not appear in both returned DataFrames.

GISTEMP Data

We have already dowloaded and preprocessed the data for this project, which can be found in the data subdirectory. However if you wish to explore the raw data and the software used to process it then download gistemp4.0.tar.gz from the GISS Surface Temperature Analysis website. The website and archive contain instructions on how to install and run the Python code.

If you choose to run GISTEMP, while running if you receive an error regarding a missing file, namely: gistemp4.0/tmp/input/Ts.strange.v4SCAR.list.IN_full, this is because the authors have a typo in gistemp4.0/steps/read_config.py. The letters SCAR should not be in the file name. You can either correct the read_config.py source code or change the file name gistemp4.0/tmp/input/Ts.strange.v4SCAR.list.IN_full to gistemp4.0/tmp/input/Ts.strange.v4.list.IN_full.

Submitting Your Assignment

You will submit your completed assignment in two formats:

Jupyter Notebook (.ipynb)

You may directly use this notebook to complete your assignment or you may use an external editor/IDE of your choice. However, to submit your code please ensure that your code works in this notebook.

HTML (.html)

To create an HTML file for your assignment simply select File > Download as > HTML (.html) from within the Jupyter Notebook.

Both files should be uploaded to Canvas.

Installing Required Packages

For this project we will be using several modules, some of which are not included in a standard installation of Python and/or Anaconda.

For Linux and Mac users the following instructions can be performed in any shell of your choosing. For Windows users the instructions are written for use in the Anaconda Powershell Prompt (anaconda3) application.

We will use the conda package installer to help us acquire and install necessary packages. One could also use pip the Python Package Installer, but this may be more involved depending on your system/setup. A common error made by those new to anaconda and/or python is to install pacakges using the wrong pacakge manager, as there may be multiple versions of conda, python etc on their system. Before installing the packages below be sure to first identify the correct conda (or pip). For Linux and Mac we will assume that Anaconda was installed locally in your home directory and the bin directory is located at ~/opt/anaconda3/bin/ (Verify the location on your system in the event your installation/setup is different). For Windows users we will assume you are using the Anaconda Powershell Prompt (anaconda3) application. The instructions below assume that the correct conda can be found in your path, however you can always use the full or relative path for your system (e.g. ~/opt/anaconda3/bin/conda).

Install the following packages before attempting to complete project.

CVX

  1. (Windows Only) This package requires the Visual Studio build tools for Python 3. Download and install the package. Then modify the Visual Studio Build Tools in the Workloads tab by clicking the checkbox for C++ build tools and then clicking the install/modify button in the bottom right corner of the window. See instructions from cvxpy.org.

  2. conda install -c conda-forge cvxpy

Cartopy

Anaconda users can use the following:

  1. conda install --channel conda-forge cartopy

Alternatively, pip can be used. However, currently cartopy requires that shapely be installed from source. The second and third step here uninstall shapely and perform a reinstall from source.

  1. pip install cartopy

  2. pip uninstall shapely

  3. pip install --no-binary :all: shapely

If errors are encountered using pip above (e.g. geos requires proj 8.0.0), then the installation may become more involved.

MacOS
    MacPorts Users:  
        sudo port install py-cartopy

    HomeBrew -> see https://scitools.org.uk/cartopy/docs/latest/installing.html
        brew install proj geos
        pip install --upgrade pyshp
        # shapely needs to be built from source to link to geos. If it is already
        # installed, uninstall it by: pip uninstall shapely
        pip install shapely --no-binary shapely
        brew install pkg-config
        export PKG_CONFIG_PATH=/usr/local/bin/pkgconfig
        pip install cartopy

Windows
    Instructions outside of Anaconda are not given here.  
    If anyone falls into this category please discuss with the instructor

Linux
    Install proj (https://proj.org/install.html#install) <- look for your specific distro
    Install GEOS (https://libgeos.org/usage/install/)
    pip install --upgrade pyshp
    # shapely needs to be built from source to link to geos. If it is already
    # installed, uninstall it by: pip3 uninstall shapely
    pip install shapely --no-binary shapely
    pip install cartopy



Gurobi Solver (optional)

  1. Acquire an academic license from the Gurobi website. See the Academic License page for instructions on downloading and installing the Gurobi Optimizer on your system.

  2. conda config --add channels http://conda.anaconda.org/gurobi

  3. conda install gurobi

Imports

Helper Functions

The following code block contains functions for loading data, creating grids and plotting.

Load Temperature Data

The following block loads the processed GISTEMP data, and organizes it using a Pandas DataFrame.

Load Station Inventory

The following block loads the inventory of stations, and creates a Pandas DataFrame containing their positions and names.

Your Solution (2): Compute Monthly Means

Complete the function compute_monthly_temperature_means to return the global monthly mean temperature by simple averaging.

Test Code

Your Solution (3): Compute Annual Means

Complete the function compute_annual_temperature_means to return the global annual mean temperature by simple averaging.

Test Code

Your Solution (4): Identify Operating Stations

Complete the functions get_operating_stations_by_year and get_operating_stations_by_month to identify the stations operating in the given date ranges.

Your Solution (5): Get Station Locations

Complete the function get_station_locations to return the latitudes and longitudes of the passed set of stations from the inventory.

Test Code

Run the following code block to demonstrate your implementation and plot all station locations in the inventory.

The following code block uses your implementation to identify a set of stations operating between begin_year and end_year and plots their locations.

The following code block uses your implementations of get_operating_stations_by_year, get_operating_stations_by_month and get_station_locations to create an animation of the station locations through time. You can modify the month, begin_year and end_year variables to modify which stations are displayed. This code block may take several minutes - it is recommended that you reduce the number of years used in the animation when testing, and only return to the full range (1880 - 2020) once you are satisfied with your implementation.

Your Solution (6): Temperature Data Access

Complete the function get_valid_station_temps_by_month and get_valid_station_temps_by_year. These functions will give access to the valid monthly and annual temperature data for all stations operating during the specified dates.

Test Code

The following code block uses your implementation to find stations operating during a given month and year and plot their locations colored by the mean temperature during that month.

Your Solution (7): Minimization Setup - Global Quantity of Interest

Complete the function setup_system_global_mean_pwc to create the system matrix and right hand side for the minimization. The system matrix will have the number of rows equal to the number of basis functions, e.g. the number of cells (boxes) in our grid, and the number of columns equal to the number of measurements. The values in this matrix will be either zero or one. If an entry at (row, col) is one, then the measurement corresponding to col exists within the box corresponding to row. The entries in the right hand side should be equal to the area of the grid cell (box) divided by the total surface area. For the surface area of a grid cell (box) you can use the formula $\frac{\pi}{180} \cdot E^2 \cdot |sin(b_0) - sin(b_1)| \cdot |b_2 - b_3|$, where $E = 6375$km is an approximate radius of the Earth and $[b_0, b_1, b_2, b_3]$ are the north, south latitudes and west, east longitudes respectively.

The box_contains function will be helpful in your implementation of setup_system_global_mean_pwc and setup_system_location_mean_pwc later in the project. The function simply returns True if the passed latitude and longitude are "within" the box and False otherwise. A lat,lon pair is within a box if south < lat <= north and west < lon <= east. Do note that with this test we exclude any station that is located at the south pole, which we will ignore for this project.

Your Solution (8): Minimzation

Complete the function optimize_weights_pwc using cvxpy to perform the optimization from Algorithm 1 in the notes with piece-wise constant basis.

Your Solution (9): Driver - Global Quantity of Interest

Complete the function optimize_global_mean_pwc. This function will compute either the global mean temperature for a given month in a given year, or if month is None then it will compute the global annual mean temperature for the given year. This function should leverage the functions you have already implemented and should return the global temperature, $\sum_{i=1}^m a_i \cdot y_i$.

Test Code

The following code block runs your optimization for a given month and year.

Comparison: Naive Mean vs Recovery

The following code blocks will use your implementations above to plot various values to compare the results of the naive mean and recovery using piecewise constants. Note that the recovery optimization can take some time to complete for all years in the data set.

Your Solution (10): Minimization Setup - Local Quantity of Interest

For this function you may copy your implementation from setup_system_global_mean_pwc and modify it slightly to implement setup_system_location_mean_pwc. Instead of the global mean temperature, our quantity of interest will be a location (or grid cell in this instance). The matrix $B$ should be the same, but the entries in the righthand side $c$ should only be set if the corresponding basis function is non-zero at eval_location.

Your Solution (11): Driver - Local Quantity of Interest

For this function you may copy your implementation from optimize_global_mean_pwc and modify it slightly to implement optimize_location_mean_pwc as it should only differ by the use of setup_system_location_mean_pwc.

Your Solution (12): Identify Texas Stations

Complete the function get_station_data_by_state to identify all stations in a given state and return new DataFrame's containing only those stations and their data. Be aware that the inventory may contain stations that do not appear in the data, and the returned DataFrames should not contain station data for stations that do not appear in both returned DataFrames.

Hint: The station names in the inventory include the US state name in which they exist at the very end of their name. When identifying stations in the state of Texas, you should find all rows whose value in the name column ends with TEXAS. For those using regular expressions, the end of the line is represented by $. Be aware that some station names may include TEXAS elsewhere in their name, but are not in Texas.

Test Code

This block uses plots the stations operating in Texas from any time.

The following block creates a grid for our piecewise constant basis over the state of texas and plots this grid along with the Texas stations and the location of College Station.

We will now use your optimization functions to evaluate the temperature in College Station using only those stations found in Texas during month and year below. According to Wunderground the temperature in College Staion for August 2010 is approximately 87°F (30.55°C).

I am sorry but I could not quite finish this project. But I am turning in what I have so far as I cannot take any more time.